Network-based confidence scoring system for genome-scale metabolic reconstructions 
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Reliability on complex biological networks reconstructions remains a concern. Although observa- 
tions are getting more and more precise, the data collection process is yet error prone and the proofs 
display uneven certitude. In the case of metabolic networks, the currently employed confidence scor- 
ing system rates reactions according to a discretized small set of labels denoting different levels of 
experimental evidence or model-based likelihood. Here, we propose a computational network-based 
system of reaction scoring that exploits the complex hierarchical structure and the statistical reg- 
ularities of the metabolic network as a bipartite graph. We use the example of Escherichia coli 
metabolism to illustrate our methodology. Our model is adjusted to the observations in order to de- 
rive connection probabilities between individual metabolite-reaction pairs and, after validation, we 
integrate individual link information to assess the reliability of each reaction in probabilistic terms. 
This network-based scoring system breaks the degeneracy of currently employed scores, enables fur- 
ther confirmation of modeling results, uncovers very specific reactions that could be functionally or 
evolutionary important, and identifies prominent experimental targets for further verification. We 
foresee a wide range of potential applications of our approach given the natural network bipartivity 
of many biological interactions. 



I. INTRODUCTION 

Cells are self-organized entities able to carry out spe- 
cialized functions at several interrelated levels, i.e. a 
genetic code is repeatedly transcribed for cell mainte- 
nance and reproduction, a proteomic set robustly em- 
bodies the cell machinery, and a mesh of metabolic reac- 
tions continuously furnishes the energy and biochemical 
compounds necessary for life. A crucial milestone to un- 
derstand and control cellular behavior is the building up 
of reliable reconstructions of the interactions spanning 
the different functional levels [IHS]. Such reconstruc- 
tions find a natural abstraction in the form of complex 
networks [4-6J, where nodes represent cellular compo- 
nents, such as genes, proteins or metabolites, while edges 
identify the presence of biological interactions between 
them |7J. These network representations enable to map 
the large-scale structure of cellular interactions [8j[9], to 
explore the basic principles of transcriptome and pro- 
teome organization [9], to identify missing genes encod- 
ing for specific metabolic functions [TO], and to analyze 
emergent global phenomena in metabolism like robust- 
ness and regulation [TTHT4] . 

At present, the information for complex network repre- 
sentations of cellular systems comes primarily from web- 
based databases, oftentimes manually curated with in- 
formation from multiple sources, like annotations from 
the literature or new experiments [15]. It is common 
to take the reliability of these data for granted and to 
draw from them resolute inferences about the properties 
or the behavior of the investigated organisms [16J. Al- 
though, in general, observations are getting more and 
more precise, uncertainties about components or inter- 
actions remain jT7} [T8j : experimental targets are many 
times biased towards the most rewarding in terms of ex- 
pected impact, experimental evidence gathered with dif- 
ferent methodologies is not always of the same quality, 
variability is unavoidably present in different organisms 



of the same species, and perfect environmental control 
in experiments is often difficult to achieve. In particu- 
lar, high-throughput techniques produce massive data in 
comparison with more dedicated experiments at the price 
of repeatedly reported inaccuracy [19J. In the best case, 
a number of alternative or missing interactions need to 
be inferred and those detected with low confidence need 
to be validated. 

Prediction of missing interactions in probabilistic 
terms is possible on the basis of the structure of the 
network alone, and could serve to better characterize 
network-based descriptions of biological systems and as 
a guide for new experiments. Despite being of enormous 
importance, this task is just a first step towards assessing 
the quality of a given data set. Metabolisms, in particu- 
lar, encode information through the metabolic processes 
themselves, either singled out or combined in pathways. 
It would be thus convenient to assess the realiability of a 
reconstructed metabolic network in terms of a reaction- 
based "confidence scoring system". The latter, although 
it may appear at first sight "chemically spurious" (canoni- 
cally a chemical reaction either exists or does not), is fully 
admitted in the context of prone-to-error large-scale re- 
constructed biochemical networks [20| |2T] . In short, it 
amounts to assign a certain level of experimental evi- 
dence, or model-based likelihood, to every particular re- 
action within a genome-scale metabolic reconstruction. 

For network analysts, scoring a biochemical process 
in terms of its confidence, apart from being experimen- 
tally motivating, is a computational challenge. Starting 
from the assessment of individual links, as it has been 
devised for instance for protein-protein interaction net- 
works [22j [23] , the problem is conditioned by the need of 
non-local models that faithfully capture the large-scale 
statistical regularities of the networks, and by the re- 
quirement of sufficient reliability in the input experimen- 
tal observations. In this respect, the reconstruction of 
genome-scale biochemical networks is a well-established 
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procedure down to stoichiometric detail and available 
metabolic network reconstructions, such as those of E. 
coli, are of sufficient quality and can be exploited to get 
accurate network-based predictions. 

Here, we propose a computational network-based sys- 
tem of reaction scoring and, subsequently, apply it to 
the metabolism of E. coli. To this end, we introduce a 
link prediction method that exploits the complex hier- 
archical structure and the statistical regularities of the 
metabolic network [24| [25] and takes explicitly into ac- 
count its bipartite nature [26, 27J. Our model is adjusted 
to the observations in order to derive connection proba- 
bilities between pairs of elements [28— 30J and, after val- 
idation, we integrate individual metabolite-reaction link 
information to assess the reliability of each reaction in the 
metabolism of E. coli, both in absolute terms and relative 
to a random null model. Such, for the first time proposed, 
network-based scoring system permits to rank reactions 
in the database, and compare our "in silico" predictions 
with "in vivo" annotations quoted in the reference liter- 
ature [21 j. In this latter context, confidence scores are 
discretized according to five integers. From a direct com- 
parison with our continuously distributed index, similar- 
ities and discrepancies are fleshed out, singularly stress- 
ing the specifities of particular reactions as compared to 
those that are baseline. At the end, our versatile ap- 
proach may provide unexisting information for unranked 
databases, or refine and complement labeled ones, aiming 
to direct more effectively experimental efforts and to un- 
veil structural levels of organization within the intrincate 
biological networks underlying living organisms. 



II. RESULTS AND DISCUSSION 
A. Network-based confidence scoring system 

Beyond experimental evidence, it is possible to assess 
confidence scores for reactions in genome-scale metabolic 
reconstructions using theoretical models. Given an ex- 
perimentally derived metabolic reconstruction, a confi- 
dence score for the reactions can be computed on the 
basis of a suited stochastic network-based model, as pro- 
posed in next section. In the context of bipartite network 
representations [23 , the model exploits the statistical 
regularities that imprint the structure of the metabolic 
network in order to ascertaining how well individual links 
between metabolites and reactions fit the observed topo- 
logical patterns. In this way, it is possible to predict 
probabilities of connection for all potential interactions, 
those already present in the reconstruction and those ab- 
sent. These probabilities are further integrated for the 
specific combination of metabolites involved in any par- 
ticular reaction. 

To define this network-based confidence score in quan- 
titative terms, we interpret that a reaction is equivalent 
to the univocal combination of its associated metabo- 
lites, and that every metabolite m has a probability p mr 



1 glucose 6-phosphate (G6P) + 1 NADP zwt > 1 6-phosphoglucono d-lactone (6PGL) + 1 NADPH 

pgl 

1 6-phosphoglucono d-lactone (6PGL) + 1 H 2 ¥-1 6-phosphogluconate (6PG) 
gnd 

1 6-phosphogluconate (6PG) + 1 NADP + > 1 ribulose 5-phosphate (R5P) + 1 NADPH 

rpe 

1 ribulose 5-phosphate (R5P) * > 1 xylulose 5-phosphate (X5P) 




Figure 1: Bipartite network representation and cor- 
responding hierarchical dendrogram. We illustrate the 
hierarchical random graph model with four coupled stoichio- 
metric equations in the pentose-phosphate pathway of E. coli, 
taken from [24], represented as a bipartite network. Reaction 
acronyms stand for the catalyzing enzyme: zwf, glucose- 6- 
phosphate dehydrogenase [EC 1. 1.1. 49] ; pgl, 6- phospho- 
gluconolactonase [EC 3. 1. 1.31] ; gnd, 6- phosphogluconate 
dehydrogenase [EC 1.1.1. 43] ; rpe, ribulose- phosphate 3- 
epimerase [EC 5. 1.3. 1]. These equations are represented as 
an unweighted undirected bipartite network formed by con- 
nections (black lines) between reactions (blue squares) and 
metabolites (orange circles) existing whenever one metabo- 
lite participates in a reaction. Notice that metabolites or 
reactions are not connected among themselves. The model 
adjusts an underlying binary tree to the observed network 
structure where each internal node t is associated with a tree 
probability pt that we transform into a distance d t = 1 — pt- 
Every pair reaction-metabolite with minimum common an- 
cestor t is then assumed to be separated a distance d m r = d t 
(the NaN notation indicates that only one class of nodes 
populates the leaves of the child branches and are thus not 
considered) . 

of being associated to a reaction r. Once the connection 
probabilities between all metabolites and all reactions are 
estimated from the model, and assuming their mutual in- 
dependence, the probability that a certain combination 
ji of metabolites co-occur in a reaction r is 

Pfxr = ]^[ Pmr Yl ( 1 "Pm'r), (1) 
men m'^n 

where the subindex m corresponds to metabolites in the 
predefined set \i and m' to those not included. Notice 
that in general the set \i could be different from the actual 
combination of metabolites associated to reaction r. The 
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average number of co-occurrences can be calculated as 
the sum over all reactions 

R 
r=l 

A network-based confidence score for a reaction can then 
be defined as the average number of occurrences n v that 
the model associates to its particular combination of 
metabolites v. Defined in this way, these scores break 
the degeneracy of reactions with identical experimental 
evidence. 



B. The Tree Distance Bipartite (TDB) model 

In the following, we introduce and discuss a stochas- 
tic network-based model to estimate the probabilities p mr 
of connection between metabolites and reactions. Taking 
advantage of their natural bipartite nature, we consider 
network representations where both metabolites and re- 
actions appear as nodes and metabolites are connected 
by edges to the reactions they participate in. We con- 
sider the simplest unweighted undirected representation, 
where substrates and products are not differentiated. See 
Fig. 1 for an example. 

Previous works have shown that the complex organi- 
zation of metabolic networks displays characteristic fea- 
tures shared by other complex networks: short topolog- 
ical diameter [31J, steady state cycles [32J or structural 
robustness [14], for instance. We implement a large-scale 
model that takes advantage of some of those organizing 
principles, in particular the heterogeneity in the number 
of connections per metabolite (degree) [33J and its hierar- 
chical architecture [25], to infer connection probabilities 
between metabolites and reactions. Network-based mod- 
els are usually prescribed in terms of connection rules 
between the nodes. These laws are stated independently 
of observed systems to produce graphical representations 
that summarize their topological structure. Notice that 
here, in contrast, we compute from the observed data 
the set of connection probabilities that has the maximum 
likelihood to reproduce the structure of an experimental 
metabolic reconstruction, so we are solving the inverse 
problem. 

Our first step is to assume an underlying metric space 
for the metabolic network where all its elements are sup- 
posed that the closer they are, the higher will be their 
connection probability. To this end, we fit a hierarchical 
random graph [29] to the metabolic reconstruction, once 
represented as a bipartite network with M metabolites 
and R reactions (see Fig. 1). More specifically, we ad- 
just the observed bipartite network to a dendrogram, or 
binary tree structure T, where metabolites and reactions 
appear as distinguishable leafs. This tree represents the 
underlying metric space, and each of the M + R— 1 inter- 
nal nodes t in the dendrogram has an associated distance 
d tl so that each pair metabolite-reaction for which t is 



the lowest common ancestor is separated by a distance 
in the tree d mr = d t , independently from whether the 
link actually exists in the network or not. We find these 
tree distances by fitting the tree to the observed network 
data combining a maximum-likelihood approach with a 
Monte Carlo sampling method that explores the space of 
all possible dendrograms (see Appendix A). Our results 
are based on intensive numerical simulations that average 
a large number of samples in the stationary state when 
changes in the form of the dendrogram do not modify the 
likelihood function beyond fluctuations. 

Once a distance d mr is associated to every possible 
pair metabolite-reaction, we correct for heterogeneity in 
the degrees of metabolites and compute the connection 
probability between metabolite m and reaction r as 

where k m is the degree of the metabolite. As a result, our 
model produces a list of estimated connection probabil- 
ities p mr between all possible combinations metabolite- 
reaction. The confidence score for every specific reaction 
can then be computed by applying Eq. ([I]) and Eq. 

C. TDB confidence scores for E. coli metabolism 

As an application of this methodology, we analyze 
the iAF1260 version of the K12 MG1655 strain of the 
metabolism of E. coli |34j provided in the BIGG database 
(http:/ /bigg.ucsd. edu/| ). From the empirical data, we 
built a bipartite network representation (see Appendix 
C) avoiding reactions that do not involve direct chem- 
ical transformation (we obviate isomerization, diffusion 
and exchange reactions) . This leads to a bipartite graph 
of 1479 reactions and 976 metabolites. As expected, 
the number of metabolites entering into a reaction k r 
follows a nearly homogeneous distribution with mean 
< k r >= 4.82 and mode 5. In contrast, the number k m 
of reactions in which a metabolite participates displays a 
scale free degree distribution P(k m ) ~ k m 2A , with an av- 
erage degree < k m >= 7.30 (see Appendix C). Currency 
metabolites are the most connected substrates, some with 
more than a hundred and up to 841 connections (fo + , /i2o, 
atp, pi, adp, ppi, nad, nadh). 

We obtain the TDB based probabilities p mr for this E. 
coli reconstruction, and we evaluate their reliability in 
statistical terms. We compute the probability that the 
model associates a higher likelihood to a missing link that 
has been removed from the bipartite metabolic network 
than to a non-existing link that was never there. To this 
end, a subset of links in the original network is removed 
uniformly at random and a new set of connection prob- 
abilities is calculated on the basis of the remaining part 
of the network. The new probabilities associated to re- 
moved connections are compared one by one to that of 
non-existing links. This reliability statistic ranges from 
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Figure 2: Evaluation of the reliability of the Tree Dis- 
tance Bipartite model. Top. Link reliability is calculated 
in statistical terms as the probability that a randomly chosen 
removed connection has higher estimated connection proba- 
bility by our method than a randomly chosen pair of uncon- 
nected reaction-metabolite. The TDB model presented here 
is compared against the CMB model, the HRBG model and 
the CN model (see Appendix B). Bottom. Distribution of 
removed links in the set of potential links ranked from more to 
less probable according to the model. Notice that the higher 
the number of removed links in the original network, the lower 
the total number of removed links in the list of potential pairs. 
This is due to the fact that the removal of all the links of a 
node is equivalent to the removal of the node itself and then 
it does not add to the list of potential connections. 



0.5 to 1 and indicates how much better our method per- 
forms as compared to a by chance baseline accuracy of 
0.5. 

Figure 2 (top) shows the reliability index for differ- 
ent fractions of links in the removed set. When 1% of 
the 7127 links in the network are removed, it takes a 
value of 0.87, meaning that 87% of the times removed 
links are ranked higher in probability by our model than 
non-existing links. In the same plot, we compared with 
alternative models (see Appendix B), such as the Config- 
uration Model for Bipartite networks [2l[35j[36] (CMB), 
the Hierarchical Random Graph model [29J generalized to 
bipartite networks (HRBG), and a local approach based 
on the computation of Common Neighbors (CN) [37] (re- 
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Figure 3: TDB confidence scores and comparison with 
CMB confidence scores. Top. The TDB confidence scores 
Stdb have been ranked in decreasing order. The higher val- 
ues form a short plateau and correspond to outlier values as- 
sociated with reactions that only involve carrier metabolites 
(hubs). Bottom. Ratio of TDB confidence scores Stdb to 
CMB confidence scores Scmb, ranked in decreasing order and 
compared to the baseline value equal to 1 (red dashed line). 
If E = Stdb / Scmb > 1 (< 1) the network displays distance 
correlations (anticorrelations) which are absent in the random 
case. 



actions) between pairs of metabolites. We find that our 
TDB model outperforms all other strategies at identify- 
ing missing interactions. For each point in Fig. 2 (top), 
we also measured the positions of the removed links in 
the ranking of potential links ordered from more to less 
probable according to the model. The results are shown 
in Fig. 2 (bottom). It can be observed that the top 1% 
of the predictions is highly accurate in the sense that it 
contains nearly 40% of all removed connections. As ex- 
pected, this value decreases as less links are available to 
the algorithm in the remaining part of the network, but it 
is noticeable that, even with a 70% of the links removed 
from the original network, the accumulation of those at 
the top 1% as ranked by the algorithm is still of more 
than 20%. 

In view of these results, we accept the accuracy at the 
statistical level of the predicted probabilities p mr and 
we use them to compute theoretical confidence scores 
Stdb = n u (p^ B ) following Eq. Q and Eq. (§ (the 
detailed list of TDB confidence scores is available un- 
der request). Very high values of the TDB confidence 
scores are typically associated to non-specific reactions 
dominated by carrier metabolites (hubs in network-based 
terms). At the top of the rank, the five reactions with 
the highest values form a group of outliers (short first 
plateau in the top graph of Fig. [3| corresponding to re- 
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actions whose metabolites are exclusively hubs. These 
non-specific reactions are at the end of catabolic chains 
and are likely to be shared by many different organisms. 
At the other extreme of the spectrum, low values of St db 
are associated with very specific reactions involving rare 
metabolites, in the sense that they enter a small number 
of reactions. In between, Stdb scores adopt a broad dis- 
tribution of continuous values ranging several orders of 
magnitude. 



D. Comparison beween Database (DB) and Tree 
Distance Bipartite (TDB) scoring systems 

In the database, every reaction (except for exchanges 
and transports through outer membrane) is endowed 
with a confidence score Sdb assessing its level of evi- 
dence. These values are discrete and range from 4 at 
the top, when there is direct biochemical proof, to at 
the bottom, when the reaction is included with no ex- 
perimental evidence, only because it improves modeling 
results. In between, values of 3 correspond to genomic 
evidence, level 2 refer to sequence homology evidence, 
and 1 stands for physiological evidence. This confidence 
scoring system presents some shortcomings, one being 
the wicked level of degeneracy implicit in the use of only 
five discrete values for lists of hundreds or thousands of 
reactions. 

Our TDB confidence scoring system breaks this de- 
generacy with a continuous spectrum of values. We pre- 
fer to restrict to the extrema of our Stdb spectrum to 
compare against Sdb scores. For reactions with scores 
in the database that indicate a strong experimental evi- 
dence (values 4 or 3), we find both high and low values of 
Stdb scores. The first situation corresponds to our TDB 
model providing complementary computational verifica- 
tion of experimental certainty. The second and more in- 
teresting cathegory uncovers very specific reactions that 
could be functionally or evolutionary important. Exam- 
ples are the five FMNH2- dependent monooxygenase reac- 
tions, the Pyridoxine 5 '-phosphate synthase reaction, or 
the Taurine dioxygenase reaction. Conversely, a weak ex- 
perimental evidence, Sdb scores 2 or 1, but a high value 
of the Stdb score, qualifies the reaction as a good target 
for further experimental verification in standard condi- 
tions. Many examples are found within the transport 
subsystems, like reactions of transport via ABC system 
(iron (II) and (III), phosphatidylglycerol, phosphatidate) . 
If the Stdb score is low, the reaction could be difficult 
to be observed experimentally except for very specific 
environments. Finally, high Stdb scores for reactions 
that where required for modeling, Sdb score 0, denotes 
consistency between our model and steady-state flux op- 
timization solutions. It is worth remarking that these 
reactions appear at the very end of the catabolic chain 
and involve more than one carrier metabolites. How- 
ever, a variety of reactions, like many in the subsystem 
of Cof actor and Prosthetic Group Biosynthesis and many 



with the highest reaction degree, manifest discrepancy 
between the models. 



E. Comparison beween Tree Distance Bipartite 
(TDB) and Configuration Model Bipartite (CMB) 
confidence scores 

Along with absolute confidence scores, we also analyze 
relative scores defined on the basis of the Configuration 
Model for Bipartite networks (CMB) [2l[35j[36] (see Ap- 
pendix B). The latter assumes the actual degree distri- 
butions for metabolites and reactions and it is otherwise 
maximally random in the assembly of connections. For 
every reaction, we calculate the Scmb score represent- 
ing its probability of occurrence according to the config- 
uration model and use this value to compute the ratio 
£ = Stdb / Scmb (the detailed list of relative scores is 
available under request). Since differences between both 
scores are mainly related to the consideration of tree dis- 
tances between metabolites and reactions in the TDB 
model, a relative score £ = Stdb / Scmb > 1 (< 1) 
points to the presence of tree distance correlations (anti- 
correlations) in the bipartite network, which are absent in 
the random case. In other words, a ratio higher (smaller) 
than one for a certain reaction indicates that its metabo- 
lites have a tendency to aggregate (avoid each other) as 
compared to the random case. 

The ranking of relative scores is shown in the bot- 
tom graph of Fig. |3j where several clusters can be dif- 
ferentiated. The first three clusters appear in slightly 
tilted plateaus with levels well separated by appreciable 
jumps. Each of them is formed by a subgroup of reactions 
that, according to the database, tend to belong to the 
same subsystem and share characteristic combinations 
of metabolites. The first group includes the FMNH2- 
dependent monooxygenase reactions, mentioned above as 
highly specific, with flavin mononucleotide and sulfite as 
reactants. The two reactions in the second plateau be- 
long to the Glycerophospholipid Metabolism subsystem 
and are the only two in the database associated to acyl 
phosphatidylglycerol. The third cluster gathers together 
the eleven reactions containing 2-Demethylmenaquinone 
8. It is remarkable that, in general, the reactions in these 
clusters have attached a high DB confidence score. Ex- 
ceptions appear in the third cluster, where one reaction 
has an experimental evidence score less than 3 while four 
reactions are included for modeling reasons and protrude 
as experimental targets for experimental verification. For 
the rest, most of the scores have values above but close to 
one and there are also over two hundred reactions with 
ratios below one. At the very tail, one finds a set of 
reactions that share the common characteristic of being 
those with the highest reaction degree and with weak or 
just modeling evidence. In particular, thiazole phosphate 
synthesis is the solely reaction involving twelve metabo- 
lites in the database and has the lowest relative score 
£ = 0.3, and noticeably also the lowest absolute Stdb 
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score (Sdb = 2). 



F. Metabolic confidence map at the level of 
pathways 

Relative scores, E, conform better than absolute 
scores, Stdb^o the idea of pathways as functional mod- 
ules since they overexpose the effect of tree distances, 
that we expect to be related with the modular organiza- 
tion of the network. We analyze the E confidence score 
map associated to the different biochemical pathways in 
E. coli in Fig. [4j For each pathway, we look at the dis- 
tribution of relative scores for the reactions within [38J. 
Notice that, even for E. coli, there remain a number of 
pathways poorly characterized at the experimental level. 
According to our relative scores, some pathways, such 
as Inorganic Ion Transport and Metabolism or Oxidative 
Phosphorilation, are associated to a relatively high aver- 
age confidence, while others subsystems, such as Alanine 
and Aspartate Metabolism, present average values very 
close to one. At a qualitative level, the overall corre- 
lation between this in silico metabolic confidence map 
and the metabolic landscape from the DB scoring sys- 
tem is noticeable, although the agreement is not perfect. 
In some cases, like for tRNA Charging, Murein Biosyn- 
thesis or Glutamate Metabolism, both maps provide very 
good agreement. In contrast, some pathways are at vari- 
ance, like the Transport, Inner Membrane or Membrane 
Lipid Metabolism, that thus appears as a prominent po- 
tential target for further evaluation and experimental ex- 
ploration. 



III. CONCLUSIONS 

The computational network-based TDB confidence 
scoring system is able to asses, in probabilistic terms, 
the reliability of reactions in metabolic reconstructions 
solely on the basis of the structure of the bipartite inter- 
actions between metabolites and reactions. It relies on a 
link prediction method adjusted to the observations that 
exploits the heterogeneity in the number of connections 
per metabolite and the hierarchical architecture of the 
network to estimate connection probabilites that after- 
wards are integrated at the level of reactions. As a result, 
our TDB scoring system is able to break the degener- 
acy of currently employed scores that only use a discrete 
number of integers to label different levels of empirical 
evidence or model-based likelihood, in addition to pro- 
viding non-existing information for unranked databases. 
TDB scores enable as well further confirmation of steady- 
state modeling solutions, uncover very specific reactions 
that could be functionally or evolutionary important, and 
identify prominent experimental targets for further veri- 
fication. When compared with a random null model that 
justs accounts for heterogeneity in the number of connec- 
tions per element, relative scores detect and quantify the 
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Figure 4: Confidence score maps for biochemical path- 
ways in E. coli. Left. Network-based relative TDB con- 
fidence scoring system. Right Database confidence scoring 
system (DB). Colors represent the percentage of reactions 
within a pathway, as defined in the database, that have a con- 
fidence score with a specific value (in the x axis). Pathways 
are ranked from top to bottom according to the average DB 
confidence score of reactions in the pathway. We clarify that 
the discretization shown for the case of the relative scores E 
is just for illustration purposes, and we emphasize that those 
are continuous and break the degeneracy inherent to the DB 
scores. The used discretization is: value for E < 1.05, value 
1 for 1.05 < E < 1.09, value 2 for 1.09 < E < 1.11, value 3 
for 1.11 < E < 1.145, value 4 for E > 1.145. 



tendency of groups of metabolites to aggregate or disag- 
gregate. This comes out as distance-based correlations 
or anticorrelations in the underlying tree metric space, 
a question worth exploring in the future in relation to 
functional modules. 

In a broader context, many biological interactions find 
a natural representation in the form of bipartite net- 
works. The ubiquity of these bipartite structures in cel- 
lular networks foretells a wide range of potential appli- 
cations of the present methodology, from the estimation 
of codon-gene association probabilities to the assessment 
of protein complexes. 
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Appendix A: Fitting the binary tree to the bipartite 
network 

We adjust to the observed data a binary tree T with 
M + R — 1 internal nodes at its bifurcation points and 
M + R leaves corresponding to the nodes of the metabolic 
bipartite network, so representing its M metabolites and 
its R reactions. Each internal node t has an associated 
tree probability p t that we transform into a distance d t = 
1 — p t . Each pair metabolite-reaction having as lowest 
common ancestor t is then separated by a distance d mr = 
d t . Given a dendrogram, the internal probabilities p t are 
only dependent on the structure of the observed bipartite 
network and can be calculated as the fraction of observed 
connections between leaves in each branch of the internal 
node over the total possible. To find the dendrogram that 
best fits the real data in terms of likelihood, we assume 
that all trees are a priori equally probable and explore 
the space of possibilities using a Markov chain Monte 
Carlo method [39J combined with a maximum likelihood 
approach, following the methodology in [29] . 

In statistical inference, the likelihood £ of a statistical 
model for a certain set of observed data is the probability 
that the model is a correct explanation, and allows us to 
estimate its unknown parameters. For a set of connection 
probabilities p t and taking into account the underlying 
tree, the likelihood function becomes 

£(D, Pt ) = l[p?<(l-p t f<- E <. (Al) 

teD 

As for unipartite graphs, the variable E t stands for the 
number of actual edges in the observed bigraph, in our 
case those that connect metabolites and reactions in the 
bipartite graph with t as their lowest common ancestor in 
T. The variable £ t corresponds to the total possible num- 
ber of such edges given reactions and metabolites in the 
different branches of the common ancestor £, discounting 
internal combinations. In the unipartite case, £ t = L t R tl 
being L t and R t the number of leaves in the left and 
right subtrees rooted at t. In our scheme for bipartite 
networks, £ t = L tm R tr + L tr R tm , counting possible com- 
binations between metabolite leaves in the left subtree 
and reaction leaves in the right subtrees and vice versa. 
Then, p t = E t /S t maximize C. 

Starting from a random configuration, we move among 
all possible sets of dendrograms by performing random 
swaps between one of the branches of a randomly cho- 
sen internal node and the alternative branch at its father 
level. This exploration is appropriate because it is er- 
godic and fulfils detailed balance. The likelihood of the 



new dendrogram produced in this way is computed and 
the dendrogram is accepted or rejected according to the 
standard Metropolis-Hastings rule [39J: the transition is 
accepted whenever the likelihood does not decrease and 
otherwise it is accepted with a probability ex.p(AlogC) 
(for computational purposes, it is more convenient to 
work with the logarithm of the likelihood function). 

After a transient period when C reaches its equilibrium 
value (except typical fluctuations) the system reaches a 
stationary state where we sample over 10 3 dendrograms 
at regular intervals to produce an average measure of 
pt and so of p mr for each possible metabolite-reaction 
pair. This model, the Hierarchical Random Bipar- 
tite Graph (HRBG), is a generalization for bipartite 
networks of the model introduced in |29j . As explained 
in the main text, in our TDB model we correct the tree 
distances d mr ( d mr = 1 — pmr) by the heterogeneity in 
the degrees of metabolites. We renormalize them accord- 
ing to Eq. ([3| to produce the connection probabilities p mr 
that we use in the construction of the confidence scores. 



Appendix B: Alternative methods 

We compare predictions of our TDB model with those 
produced by other alternatives like the Configuration 
Model for Bipartite networks [261 l35l 136] (CMB), the 
Hierarchical Random Graph model [29J generalized for 
bipartite networks (HRBG) (values of p mr are taken di- 
rectly for Pmr), and a local similarity measure counting 
Common Neighbors [37J (CN). 

1. The Configuration Model for Bipartite networks 

(CMB) assumes a certain number of reactions R, a 
certain number of metabolites M, and their degree dis- 
tributions P(k r ) and P(/c m ), which should fulfill the re- 
quirement (k r ) R = (k m ) M, where (k r ) is the average 
number of metabolites in the reactions and (k m ) is the 
average number of reactions in which metabolites partic- 
ipate. Metabolites and reactions are partitioned into two 
different classes and each element in each class is assigned 
an expected degree from the corresponding distribution, 
which is attached in the form of stubs. Two stubs, one in 
each partition, are selected at random and the link be- 
tween the metabolite and the reaction is created avoiding 
multiple connections. 

For the CMB, p mr = fffjR an< ^ Ec l- Q can ^ e emu- 
lated analytically. Since the distribution of the bipartite 
degrees of the reactions is nearly homogeneous, k r « (k r ), 
it becomes 

-v^nln(i-f). (B« 

that gives the CMB confidence score for a metabolic re- 
action when its set of associated metabolites is v. 
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2. Common Neighbors 

(CN) measure counts the number of shared neigh- 
bors of a given pair. This measure represents a fam- 
ily of overlap measures quantifying similarity between 
nodes and a normalized version was specifically intro- 
duced for the study of the hierarchical modularity of 
metabolic networks and to delineate the functional mod- 
ules based on the network topology [25]. In the case of 
bipartite metabolic networks, we define common neigh- 
bors for a pair of metabolites as the number of reac- 
tions in which they are concurrent, o mm /, and we esti- 
mate the probability of connection metabolite-reaction 

as Pmr X/m'Cr °rnm' / X^Vm' °rnm' - 

Appendix C: E. coli bipartite network representation 

In order to build a bipartite network representation 
of the metabolism of E. coli, we use the iAF1260 ver- 
sion of the K12 MG1655 strain [34J provided in the 
BIGG database (http://bigg.ucsd.edu/). It comprises 
1039 metabolites and 2381 reactions which include iso- 
merizations, exchange reactions, intracompartment reac- 
tions, and different types of transport reactions (some 
of which involve chemical transformation) between three 
different compartments: cytosol, periplasm, and a third 
symbolic one representing the extra-organism. The most 
simple representation is in the form of an unweighted 
undirected network without self-loops or dangling ends 
(dead end reactions). More refined versions would take 
into account directionality of the reactions, stoichiomet- 
ric coefficients, self-loops, etc.. 

Inside compartments, isomerization reactions (some 
reversible and some irreversible) transform the structure 
of one compound without altering its molecular formula. 
Isomers can have significantly different properties, so in 
principle it seems reasonable to consider those reactions 
in a network representation, and the two compounds as 
separate. However, the topological confidence score of 
those reactions will depend exclusively on the joint prob- 
ability of occurrence of the pair of isomers and, most 
probably, the score would be expected to be low since 
they only enter together into those reactions. Our option 
(to be consistent with the treatment of diffusion reactions 
that lead to the same problematic as we explain below) 
is to neglect those reactions and to take the isomers as a 
single entity. 

Exchange reactions represent the exchange of metabo- 
lites between the cell and the environment. These are 
reversible and involve a single metabolite. In a bipar- 
tite representation, exchange reactions would be dangling 
ends with a single incoming connection. Sinks needed to 
allow metabolites to leave the system (irreversible reac- 
tions in the cytoplasm) can be treated in the same way. 
Since they would count in the total amount of reactions 
but never two metabolites would enter them simultane- 
ously, it seems reasonable to neglect them in the context 



of this work. Furthermore, exchanges or sinks have no as- 
sociated reconstruction confidence score in the database. 

Regarding transport reactions, there are several op- 
tions. In E. coli, three different compartments are differ- 
entiated: cytosol, periplasm, and extra-organism. Every 
compound in a different compartment is considered as an 
individual specie and transport steps are formally consid- 
ered as reactions transferring the compound belonging 
to one compartment into the same compound belonging 
to the other compartment (the respective concentrations 
can be different, and the compartments usually have dif- 
ferent volume). Transport reactions between compart- 
ments are of different kind. Basic general mechanisms 
are: 

• Via diffusion (membrane permeable to water 
molecules and a few other small, uncharged, 
molecules like oxygen and carbon dioxide that dif- 
fuse freely in and out of the cell) or via facilitated 
diffusion (transmembrane proteins create a water- 
filled pore through which ions and some small hy- 
drophilic molecules can pass by diffusion. The 
channels can be opened or closed according to 
the needs of the cell). Molecules and ions move 
spontaneously down their concentration gradient. 
No chemical transformation is involved. In the 
database, most are reversible processes while some 
are irreversible. 

• Via active transport. Active transport is defined 
as mass transport from a region of lower to a re- 
gion of a higher electrochemical potential. It is 
transport that requires energy. Active transport 
across biological membranes occurs via enzymes. 
Transmembrane proteins, called transporters, use 
the energy of ATP (hydrolyzation of ATP, in gen- 
eral ATP + H20- > ADP + Pi, AG° ATP = 
— 30, 3KJ/mol or —7.3kcal/mol, enough to pump 
2 sodium ions) to force ions or small molecules 
through the membrane against their concentration 
gradient. Active transport can be direct or indirect: 

— Direct active transport. Some transporters 
bind ATP directly and use the energy of 
its hydrolysis to drive active transport (ATP 
powered pumps). In the database, these 
are named as ABC (ATP-Binding Cassette) 
transmembrane proteins, which expose a 
ligand-binding domain usually restricted to a 
single type of molecule at one surface and 
an ATP-binding domain at the other surface; 
the ATP bound to its domain provides the 
energy to pump the ligand across the mem- 
brane. In the database, hydrolyzation of 
ATP correspods to the chemical transforma- 
tion atp[c] + h2o[c]— > adp[c] + h[c] + pi[c] 
(cytoplasm, periplasm). 

— Indirect active transport. A discrete class 
of proteins import or export ions and small 
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Figure 5: Some type of reactions and its topological bipartite representation in E. Coli metabolic network. 



molecules, such as glucose and amino acids, 
against a concentration gradient. These 
proteins use the energy stored in the elec- 
trochemical gradient of a directly-pumped 
ion to power the uphill movement of an- 
other substance, a process called cotrans- 
port. Direct active transport of the ion es- 
tablishes a concentration gradient. When 
this is relieved by facilitated diffusion, the 
energy released can be harnessed to the 
pumping of some other ion or molecule. 
Symport pumps: the driving cotransported 
ion (H+,Na+) and the transported molecule 
pass through the membrane pump in the 
same direction. The driving ions flow down 
their concentration gradient while the coupled 
molecules are pumped up theirs; later the ion 
is pumped back out of the cell by a direct 
active transport process. Antiport pumps: 
the driving ion (again, usually sodium, 
proton / phosphate / succinate / nitrite / ) diffuses 
through the pump in one direction providing 
the energy for the active transport of some 
other molecule or ion in the opposite direc- 
tion. In any case, no direct chemical change 
in the database. 

In the database, other transport reactions also appear. 
Some do not involve chemical transformation like for indi- 
rect active transport: via ton system, or secretion (with- 
out known transport mechanism). A second class involve 
chemical transformation of metabolites along transport: 



pep-pyr system, oxidative phosphorylation, vectorial Co- 
A coupling, nitrogen metabolism, etc.. 

In relation to the bipartite network representation, 
one needs to differentiate the different transport mecha- 
nism: diffusion, transport without chemical transforma- 
tion, and transport involving chemical transformation of 
carriers (see Fig. 5). 

Diffusion reactions do not transform any compound 
and involve the same metabolite both as input and out- 
put (self-loops in a metabolite one- mode projection), and 
so require some specific treatment. One possibility is to 
consider the metabolites in both parts of the reaction as 
different entities, the effective pair. Since this is the only 
pair entering such reactions, the topological score is ex- 
pected to be low as for isomerization reactions. However, 
since the metabolite is the same with the same proper- 
ties, a better option in this case is to neglect these reac- 
tions and do not differentiate the metabolite in different 
compartments. 

Other transport reactions without chemical transfor- 
mation involve more than one metabolite, usually two 
that change compartment. In topological terms, this is 
a situation analogous to diffusion and the same treat- 
ment of ignoring the reaction and taking the metabolite 
in different compartments as the same can be applied. 

Transport reactions that need to transform metabolites 
to transfer a compound have different metabolites enter- 
ing and leaving the reaction except the compound that 
is transported across compartments. To be consistent 
with the previous treatments, that metabolite in differ- 
ent compartments would be treated as a single entity. 
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rank k. 



Figure 6: Rank distribution of degrees of metabolites 
and reactions and cumulative bipartite degree distri- 
butions. The degree of a metabolite is taken as the number 
of reactions it participates in and the degree of a reaction is 
the number of different metabolites it involves. 




Figure 7: Average nearest neighbors bipartite degree 
of metabolites and reactions. 



Finally, chemical transformation reactions that hap- 
pen in more than one compartment (or that need some 
component of a different compartment from where it is 
happening) can be treated as chemical transformation 
reactions inside compartments. 

In summary (see Table 1), we take metabolites as the 
same entity independently of the compartment and we 
obviate isomerizations, exchange, and diffusion reactions. 
This leaves a total of 1479 reactions out of 2381 that in- 
volve 976 metabolites out of 1039 (isomers have been 
identified with the same id). Apart from the identifica- 
tion of isomers, 5 metabolites have been removed (mn2 
(id 30), ca2 (id 60), nal (id 383), ag (id 651), and cl (id 
654)) because they do not enter in any transformation re- 
action. They seem to be necessary to transfer compounds 
(typically h but also others) across compartments. 

In Fig. 6, we show the cumulative bipartite degree dis- 
tribution both for metabolites and reactions. The degree 
of a metabolite is taken as the number of reactions it par- 
ticipates in and the degree of a reaction is the number of 
different metabolites it involves. The number of metabo- 
lites entering into a reaction k r follows a homogeneous 
distribution with mean < k r >= 4.82 and mode 5. In 
contrast, the number k m of reactions in which a metabo- 
lite participates displays a scale free degree distribution 
P(km) ~ ^m 21 with an average degree < k m >= 7.30. 
Currency metabolites are the most connected substrates, 
some with more than a hundred and up to 841 connec- 
tions (/i + , /i2o, atp, pi, adp, ppi, nad, nadh). 

The average nearest neighbors bipartite degree of 
metabolites and reactions is displayed in Fig. 7. The ris- 
ing of the average levels for higher values of the degrees 
means that the reactions involving more metabolites also 
involve more carrier metabolites (hubs), compartments 
and also enter in exchange reactions. 
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Table I: Classification of reactions in the database according to compartment implications, function and topological structure. 
The total number of reactions R in the database is 2381. 



Type 


Classification of reactions 

ID Description 


in E. Coli metabolic network 

Number of reactions 


Included in graph 




1 cytosol 


1100 


Yes 




2 isomerization in cytoplasm 


59 


No 


internal 


3 periplasm 


190 


Yes 




4 isomerization in periplasm 


2 


No 




5 extra-organism 


8 


Yes 



exchange 



6 sinks in cytosol 

7 exchange in extra-organism 



5 

299 



No 
No 



transport 



8 diffusion, facilitated diffusion, via vector, channel, flip- 
ping 

9 ABC system, direct active transport (+1 detoxification) 

10 symport, +1 similar reaction ID 11690 

11 antiport, +6 similar reactions Ids 11407, 11409, 11411, 
11413, 11415, 11835 

12 ton system 

13 secretion (transport mechanism not known) 

14 reactions with transformation involving different com- 
partments (pepipyr, oxidative phosphorylation, vectoral 
Co-A coupling, nitrogen metabolism, etc.) 



346 

124 
124 
50 

11 

6 
57 



No 

Yes 
No 
No 

No 
No 
Yes 



